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ABSTRACT 

Developing  memory  systems  to  support  high-speed  processors  is  a  major  challenge  to 
computer  architects.  Cache  memories  can  improve  system  performance  but  the  latency  of  main 
memory  remains  a  major  penalty  for  a  cache-miss.  A  novel  ^proach  to  improve  system 
performance  is  the  use  of  a  memory  prediction  buffer.  The  memory  prediction  bufferiMPB)  is 
inserted  between  the  cache  and  main  memory.  The  MPB  predicts  the  next  cache-miss  address  and 
pre-fetches  the  data.  The  use  of  an  MPB  in  a  computer  system  is  shown  to  decrease  main-memory 
latency  and  increase  system  performance. 
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I.  INTRODUCTION 


The  technological  advances  in  high-speed,  general  purpose  processors  have  outpaced  the 
support  provided  by  main  memory  systems.  In  addition,  software  applications  continue  to  grow  in 
processor  and  memory  requirements.  The  major  factors  in  the  design  of  memory  systems  are  size 
of  address  space,  bandwidth  required,  main-memory  latency,  and  memory  subsystem  cost.  Large 
memory  subsystems  use  dynamic  random-access  memories  because  of  their  low  cost  per  bit. 
Caching  schemes,  which  employ  high-cost,  high-speed  memories,  are  used  to  overcome  main- 
memory  latency  and  increase  bandwidth.  However,  main  memory  latency,  which  is  the  time  (in 
processor  cycles)  between  the  start  of  a  memory  fetch  and  the  start  of  the  transfer  of  requested  data, 
is  significant  and  increasing  [PRZYBY90].  Furthff  gains  in  memory  system  performance  are 
possible  through  the  use  of  different  manufacturing  processes  (CMOS,  BiCMOS,  ECL  and  GaAs) 
[VAGTS92]  and  stringent  design  of  the  memory  hierarchy.  One  such  memory  performance 
enhancement  is  the  prediction  of  a  cache-miss  read  address  request  to  main  memory.  If  the  read 
address  is  predicted  and  the  data  made  available,  then  the  overall  system  performance  is  improved. 

Since  current  RISC  processors  far  exceed  the  capability  of  main  memory  systems,  the  focus 
for  the  computer  systems  architect  is  how  to  improve  the  poformance  of  the  memory  hierarchy. 
Large,  fully-associative  caches  are  cost  prohibitive,  and  direa-mapped  caches  offer  an  excellent 
alternative  [HILL88].  Direct-mapped  caches  have  a  higher  miss  rate  than  fully-associative  or  set- 
associative  caches.  A  disadvantage  of  cache  memories,  in  general,  is  the  miss 
penalty  [PATHEN90],[PRZYBYZ90].  The  reduction  of  the  miss  rate  and  subsequent  miss  penalty 
is  the  motivation  for  the  memory  prediction  buffer  (MPB). 

Conceptually,  the  MPB  is  an  enhancement  for  the  data  cache.  The  behavior  of  processors 
utilizing  separate  data  and  instruction  caches  is  noted  in  this  research  and 
others[JOUPPI90],[PRZYBY90].  Examination  of  this  behavior  shows  that  instruction  caches  and 
data  caches  behave  differently.  Instruction  caches  can  improve  effectiveness  by  simply  prefetching 
the  next  instruction.  This  ^proach  is  shown  to  be  less  effective  for  data  caches 
[PATHEN90],[JOUPPI90].  If  this  approach  is  used  for  data  cache  management,  it  contributes  to 
pollution  of  the  cache  and  increases  the  number  of  capacity  misses.  Since  most  modem  RISC 
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processors  have  separate  instruction  and  data  caches,  and  employ  some  prefetch  mechanism  for  the 
instruction  cache,  this  research  will  focus  on  improving  the  effectiveness  of  the  data  cache  by 
inserting  an  MPB  between  the  cache  and  its  refill  line  (main  memory,  in  most  cases).  Although  this 
organization  is  the  focus  for  this  research,  it  is  not  the  only  implementation  possible  for  the 
MPB[NOWICKI92]. 
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n.  MEMORY  HIERARCHY  AND  LATENCY  REDUCTION 


The  von  Neumann  architecture,  used  by  most  single-instruction-single-data^  (SISD)  and 
single-instruction-multiple-data  (SIMD)  machines,  has  some  baseline  behavioral  characteristics  to 
consider  [HWANG84].  The  characteristics  of  the  memory  subsystem  provide  the  parameters  for 
optimization  of  the  operational  behavior  of  the  memory  subsystem  in  conjunction  with  the 
processor  and  secondary  storage.  First,  stored  programs  obey  the  principle  of  locality 
[PATHEN90].  This  principle  has  two  components  which  state  that  programs,  while  executing, 
favor  only  a  portion  of  their  address  space  at  a  given  instant.  The  two  components  are: 

*  Spatial  Locality  -  Programs  tend  to  request  data  and  instructions  that  have  memory 
addresses  near  the  instructions  and  data  currently  being  used.  The  von  Neumann 
architecture  provides  for  the  execution  of  sequential  program  instructions  and  programs  use 
related  data  items  which  are  likely  to  be  adjacently  stored. 

*  Temporal  Locality  -  Programs  tend  to  use  current  information  and  data.  That  is.  if  an  item  is 
referenced,  it  will  probably  be  referenced  again  soon.  The  older  the  information,  the  less 
likely  it  is  that  the  program  will  again  reference  it.  Temporal  locality  is  especially  evident  in 
the  execution  of  program  loops  where  instruction  and  data  are  used  several  times  within  a 
shon  period  of  time. 

With  reference  to  these  principles,  high-speed  buffers  are  insened  between  the  main  memory 
and  the  processor.  These  buffers  are  known  as  caches.  The  caches  store  portions  of  main  memory 
which  are  currently  in  use  by  the  executing  program.  This  allows  rapid  access  by  the  processor  of 
the  instructions  and  data  needed  to  continue  processing.  Although  the  cache  does  a  great  job  of 
hiding  main  memory  latency,  a  disadvantage  of  its  use  is  the  penalty  for  a  cache  miss.  The 
construction  of  the  cache  gives  the  following  behavitH-al  characteristics  for  a  cache  miss. 

*  Compulsory  -  cache  misses  that  occur  when  a  block  is  first  accessed  and  the  program  is  just 
starting.  These  are  sometimes  called  cold  start  misses  since  the  cache  has  never  held  the 
information  requested. 

*  Capacity  -  cache  misses  that  occur  when  discarded  blocks  are  again  referenced  by  the 
executing  program.  These  misses  are  inevitable  since  the  cache  size  is  less  than  main 
memory  size. 

*  Conflict  -  the  block  placement  strategy  dictates  conflict  misses.  Conflict  misses  occur  when 
a  block  is  discarded  because  too  many  incoming  blocks  map  to  the  same  set  and  the 


1.  Flynn’s  classification  (1966)  is  based  on  the  multiplicity  of  instruction  streams  and  data  streams 
in  a  computer  system  [HWANG84]. 
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discarded  block  is  soon  needed.  This  characteristic  is  evident  in  both  set-associative  mapped 
and  direct-mapped  caches. 

The  structure  of  the  memory  subsystem  is  given  in  Figure  1.  Traversing  down  the  hierarchy, 
access  time  increases  and  the  storage  size  increases.  However,  bandwidth  decreases  significantly 
while  traversing  the  hierarchy,  top  to  bonom.  Some  nominal  figures  for  size  and  bandwidth  are  also 
given  in  Figure  1.  It  is  worthy  to  note  that  each  level  is  a  subset  of  the  next  lower  level.  That  is,  each 
level  contains  only  a  subset  of  the  information  contained  in  the  next  lower  level.  This  presents  a 
constraint  of  maintaining  coherency  (correct  information)  throughout  the  hierarchy.  The  MPB 
receives  its  information  from  the  next  lower  level  of  the  hierarchy.  In  this  research,  the  next  level 
of  the  hierarchy  is  the  main  memory.  For  the  development  of  the  concept  of  the  MPB  and  for  most 
of  the  simulation  described  here,  the  MPB  is  not  involved  in  the  write  policy  of  the  cache.  The  MPB 
always  gets  its  data  from  the  main  memory  which  is  kept  up  to  date.  Further  research  of  the  MPB 
will  study  the  implementation  of  a  write-through  policy  for  coherency.  Write-back  performance 
wUl  also  be  examined  in  follow-on  research. 
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Figure  1:  Memory  Hierarchy 
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in.  PERFORMANCE  METRICS 


In  order  to  investigate  the  performance  of  the  memory  subsystem,  characteristics  of  the  memory 
subsystem  must  be  developed.  From  the  system  perspective,  work  completed  in  time  defines  sys¬ 
tem  performance.  Hence,  system  performance  can  be  described  analytically  as  Equation  1. 


Instmcuons  Completed 

System  Perfcmance  =  - = - .  _  - 

Elapsed  Time 


(1) 


This  definition  of  system  performance  does  derive  the  ubiquitous  MIPS  units.  This  unit  of  mea¬ 
surement  should  not  be  used  in  comparison  of  different  systems  performing  the  same  task 
[PATHEN90].  However,  for  characterization  of  a  specific  system  performing  the  same  task,  this 
unit  of  measure  is  useful.  This  measure  of  performance  can  be  focused  in  terms  of  processor 
cycles.  Efficiency  is  a  product  of  the  number  of  instructions  executed,  the  number  of  clock  cycles 
per  instruction  and  the  clock  speed  (Equation  2). 

£  =  /  CP/  /  (2) 

Expanding  this  model,  the  number  of  cycles  per  instruction  executed  is  the  metric  that  is 
directed  influenced  by  the  memory  subsystem.  Statistically,  a  more  stable  metric  is  the  effective 
CPI.  The  effective  CPI  is  the  statistical  average  of  several  measurements.  The  effective  CPI  is 


CPI, 


=  V  — 

~  2^  i 


‘EFF  -  t  (3) 

The  number  of  cycles  per  instructions  is  largely  determined  by  processor  architecture  and  regis- 

I 

ter/cache  structure(effectiveness).  With  a  focus  toward  the  memory  structure,  the  effective  access 
time  of  the  memory  subsystem  is  the  best  metric  to  indicate  memory  subsystem  performance.  This 
parameter  depends  on  the  cache  access  time  and  the  main  memory  access  time.  By  decreasing  the 
number  of  cycles  per  instruction,  the  system  performance  is  improved.  The  speedup  in  system  po-- 
formance  is  modelled  by  Equation  4. 


CP  I,,, -CP  I 


EFF(MPB) 


CPI 


1  - 


CP/ 


EFF(MP8) 


EFF 


CPI 


(4) 


EFF 
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The  nominal  figures  for  the  number  of  cycles  per  instruction  in  high  performance  processors  is 
1 .2-2.0  CPI.  If  we  assume  that  the  processor  can  execute  instructions  at  the  bandwidth  of  the  mem¬ 
ory  subsystem,  the  speedup  becomes  a  function  of  the  effective  access  time  of  the  memory  sub¬ 
system.  Equation  5  determines  the  speedup  of  a  given  system  by  reference  to  the  effective  access 
time  with  the  MPB  and  without  the  MPB. 


S  =  1- 


'  EA  (MPB) 


'  EA 


(5) 


The  effective  access  time  measures  the  memory  hierarchy  performance.  The  effective  access 
time  is  therefore,  a  function  of  the  cache  performance  and  main  memory  performance  as  noted  in 
Equation  6. 


^EA  =  ^CS  *  ^HR  ^CF  +  ( 1  ~  ^HR^  ^ MR  (6) 

This  relationship  can  be  simplified  by  noting  the  time  for  a  cache  tag  search  is  very  small.  In 
addition,  the  cache  tag  search  and  cache  fetch  are  much  smaller  than  the  time  to  read/fetch  data 
from  main  memory,  T^r.  The  effective  access  time  can  then  be  q)proximated  as  in  Equation  7. 


^EA  •  ^HR  ■  ^  ~  (7) 

This  approximation  can  be  used  only  for  comparison  between  simulation  models.  The  descrip¬ 
tion  given  by  Equation  6  must  be  used  for  evaluation  of  the  simulation  model  with  respect  to 
implementation  performance. 
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IV.  MEMORY  PREDICTION  BUFFER 


The  memory  prediction  buffer(MPB)  was  conceived  to  predict  the  next  cache-miss  address 
and  prefetch  the  data  before  the  request  is  made  by  the  processor.  The  MPB  can  be  inserted  between 
the  cache  and  its  refill  line  as  depicted  in  Figure  2.  Another  possible  configuration  could  be  the  use 


CENTRAL  PROCESSING  UNIT 


MAIN  MEMORY 


MEMORY  SUBSYSTEM 


Figure  2:  MPB  With  Cache  Implementation 
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of  smaller  MPBs  attached  to  individual  memory  chips  (DRAMs).  This  implementation  is  realized 
in  recent  work  by  Nowicki[NOWICK92].  A  block  diagram  of  this  approach  is  given  in  Figure  3.  In 
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Figure  3:  MPB  With  Main  Memory  Implementation 

the  early  research  of  this  idea,  efforts  turned  instinctively  toward  statistical  methods  for  prediction. 
The  area  of  digital  signal  processing  was  explored  for  possible  solutions  to  the  prediction 
requirement[HAMMIN83],[THERRI92].  Kalman  filters,  Wiener  fdters  and  other  adaptive 
techniques  for  prediction  were  proposed  and  investigated.  However,  further  characterization  of  the 
problem  provided  more  specifications  for  possible  solutions. 
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Cache  simulation  was  achieved  using  Mark  Hill’s  OINEROm  cache  simulator.  The  model 
cache  is  a  direct-mapped,  8K  data,  8K  instruction  with  a  32  byte  line  size.  Using  various  ATUM 
trac>2s[GRIMSR92]  and  DEC  traces[BORG90],  cache  miss  addresses  wo'e 
investigated[AGARWL86].  Review  of  the  traces  show  that  spatial  locality  and  temporal  locality  are 
valid  for  all  processes.  Since  no  curves  are  noted  in  the  traces,  prediction  should  employ  linear 
methods.  The  physical  construction  of  the  memory  prediction  buffer  is  given  in  Figure  4.  The 
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Figure  4:  Memory  Prediction  Buffer 

simulation  was  configured  to  give  the  number  of  cache  hits  before  a  miss  is  encountered.  The 
average  of  these  miss  events  give  the  conshaint  of  time  available  to  predict  and  prefetch  a  miss 
address.  Since  the  average  of  cache-hits  before  a  cache-miss  is  4-6,  it  is  possible  that  some  6-10 
cycles  are  available  for  prediction  and  prefetch.  In  addition,  the  system  bus  bandwidth  must  be 
considered  for  prefetch  solution.  These  constraints  were  responsible  for  the  development  of  a 
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simpler  prediction  algorithm.  The  prediction  algorithm  yields  a  bias  for  the  ensuing  prefetch.  The 
algorithm  is  implemented  in  C  for  simulation. 

If  the  current  address  is  larger  than  the  past  address,  then  the  bias  is  positive  (negative 
otherwise).  The  algorithm  for  the  MPB  is  given  in  Figure  5.  The  determination  and  application  of 


receive  address  request  from  processor 
determine  block  address  (boundary) 
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is  ths  sddross 
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_ I _ 

send  requested  data  to  processor 

compare  address  requested  with  previous  address  ^ 

request  and  calculate  bias 

apply  bias  to  last  address  to  obtain  predicted  address 
fetch  data  at  predicted  address 


Figure  5:  Memory  Prediction  Buffer  Algorithm 

the  bias  is  central  to  the  algorithm.  The  bias  is  simply  the  difference  in  address  boundaries  (if  word 
aligned)  of  the  previous  address  and  the  current  address.  If  the  address  requested  is  greater  than  32K 
away,  another  address  stream  bias  is  established.  The  corresponding  address  stream  bias  is  used  to 
predict  the  next  requested  address.  The  bias  may  be  positive  or  negative,  that  is,  ascending  or 
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descending  in  memory.  The  correct  address  stream  bias  is  determined  using  a  simple  but  fast  binary 
search.  The  search  time  can  be  reduced  further  using  a  fully  associative  algorithm. 

The  structure  of  the  memory  prediction  buffer  is  similar  to  a  conventional  fiilly-associative 
cache.  The  MPB  is  composed  of  m  lines  of  n  byte  blocks.  For  the  cache  used  in  this  research,  the 
MPB  has  16-256  lines  of  32  byte  blocks.  The  blocks  are  aligned  on  the  same  address! word) 
boundaries  as  the  fu'st  level  cache.  The  block  size  is  dependent  on  the  block  size  of  the  first  level 
cache.  The  optimal  size  of  the  MPB  is  64-256  lines.  This  size  is  due  to  the  fan-out  requirements  (and 
costs)  for  the  construction  of  a  fiilly  associative  cache  and  the  number  of  lines  (sets)  needed  to  allow 
effective  use  of  the  replacement  policy  used  (random  replacement  vice  LRU,  FIFO,  etc.).  If  a  LRU 
replacement  policy  is  used  instead  of  random  replacement,  a  smaller  MPB  can  be  used  to  give  the 
same  poformance  improvement. 
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V.  MEMORY  PREDICTION  BUFFER  PERFORMANCE 


A.  MPB  THEORECnCAL  PERFORMANCE 

The  memory  prediction  buffer  determines  the  future  cache  miss  address  using  previous  cache 
miss  addresses.  For  this  analysis,  only  the  data  cache  is  given  a  MPB.  The  instruction  cache  is  set 
to  prefetch  instructions.  Given  a  model  cache  with  a  hit  ratio  of  93.2%,  if  the  MBP  is  found  to  be 
correa  on  33%  of  its  predictions,  an  increase  of  2. 1  %  is  realized  for  the  cache  hit  rate.  The  effective 
cache  hit  ratio  is  improved  to  93.2%  from  93.3%.  The  graph  of  Figure6  gives  the  effective  cache 


Figure  6:  MPB  Perfonnaoce  Graph 


hit  rate  as  a  fiuictiOD  of  MFP  effectiveness.  There  are  four  cache  models  that  are  compared.  One 
model  has  an  80%  initial  hit  rate,  another  model  has  an  83%  hit  rate  and  so  oa  A  sample  reading  is 
shown  for  a  base  cache  hit  ratio  of  80%  with  an  MPB  effectiveness  rating  of  20%.  The  resulting 
effective  cache  hit  ratio  for  this  sample  is  84%.  This  is  an  increase  of  4%  in  the  effective  cache  hit 
ratio.  The  resulting  system  performance  achieves  a  speedup  of  9%. 

The  model  system  for  this  investigation  has  10ns  cache  memory  and  80ns  main  memory.  This 
model  memory  hierarchy  is  used  by  the  simulation  smdy  also.  The  cycle  time  of  the  main  memm^ 
is  not  considered  but  would  add  to  the  effectiveness  of  the  MPB. 
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B.  BASELINE  SYSTEM  PERFORMANCE 


In  order  to  compare  the  performance  of  the  MPB  to  existing  latency  reduction  strategies, 
several  measurements  of  the  baseline  system  had  to  be  collected  and  examined.  This  baseline 
system  was  constructed  using  the  cache  simulator.  DINEROm.  The  system  simulates  separate  8K 
direct-mapped  data  and  8K  direct-mapped  instruction  caches. 


Table  1:  BASELINE  SYSTEM  PERFORMANCE 


Process 

Cache 

Size 

HRl 

HRc 

HRsys 

Speedup 

8K  FIRST  LEVEL  CACHE  BASE-SYSTEM  PERFORMANCE 

SPICE 

8192 

96.51 

96.51 

96.51 

-0- 

Pascal 

8192 

91.57 

91.57 

91.57 

-0- 

LISP 

8192 

92.44 

92.44 

92.44 

-0- 

FORTRAN 

8192 

93.88 

93.88 

93.88 

-0- 

Tree 

8192 

98.66 

98.66 

98.66 

-0- 

SOR 

8192 

90.50 

90.50 

90.50 

-0- 

12K  FIRST  LEVEL  CACHE  PERFORMANCE 

SPICE 

12288 

97.16 

97.16 

97.16 

3.66 

Pascal 

12288 

94.40 

94.40 

94.40 

12.46 

LISP 

12288 

96.32 

96.32 

96.32 

17.76 

FORTRAN 

12288 

95.11 

95.11 

95.11 

6.03 

Tree 

12288 

97.43 

97.43 

97.43 

(-7.87) 

SOR 

12288 

91.16 

91.16 

91.16 

2.77 

8K  FIRST  LEVEL  CACHE  (DM)  WITH  4K  SECOND  LEVEL  CACHE  (FA) 

SPICE 

4096 

24.46 

96.51 

97.37 

4.84 

Pascal 

4096 

36.91 

91.57 

94.68 

13.69 

LISP 

4096 

75.59 

92.44 

98.16 

26.18 

FORTRAN 

4096 

32.58 

93.88 

95.81 

9.46 

Tree 

4096 

68.32 

98.56 

99.44 

4.99 
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Table  1:  BASELINE  SYSTEM  PERFORMANCE 


Process 

Cache 

Size 

HRl 

HRc 

HRsys 

Speedup 

SOR 

4096 

23.84 

90.50 

92.77 

9.54 

C.  MPB  SIMULATION  PERFORMANCE 


The  theoretical  study  of  the  MPB  was  realized  when  implemented  using  trace-driven 
simulation  (TDS)[GRIMSR92]  with  the  DINEROIH  cache  simulator  (provided  by  Mark  Hill).  As 
with  any  TDS  research,  address  traces  and  their  accuracy  are  critical  to  proper  simulation.  For  this 
research.  ATUM  traces[AGARWL86]  and  DEC  Titan[BORG90]  traces  were  used.  Some 
behavioral  characteristics  of  the  simulation  are  graphically  illustrated  in  the  appendix.  Table  2  gives 


Table  2;  MEMORY  PREDICTION  BUFFER  PERFORMANCE(DEC) 


Process 

MPB 

Lines 

Blocks 
per  line 

HRmpb 

HRc 

HRsys 

Speedup 

TREE  1 

128 

32 

69.89 

97.87 

99.37 

9.14 

TREE  2 

128 

32 

59.57 

98.01 

99.20 

7.31 

SORl 

128 

32 

12.77 

90.51 

91.79 

5.38 

SOR  2 

128 

32 

10.20 

90.29 

91.35 

4.42 

a  summary  of  MPB  performance  for  two  processes  and  two  runs  of  each.  SOR  is  Renato  Deleones’ 
successive  over-relaxation  algorithm  that  uses  sparse  matrices.  TREE  is  Joel  Bartletts’  program 
which  builds  a  tree  data  structure  and  searches  for  the  largest  element  in  the  tree.  His  program  is  a 
variant  of  LISP.  Both  of  these  process  traces  were  provided  by  DEC  WRL.  The  model  system  is  a 
RISC  processor  with  sq>arate  8K  instruction  and  8K  data  caches.  There  are  32-byte  blocks  in  the 
cache  and  in  the  MPB.  The  cache  is  direct-mapped  for  reasons  given  by  [HILL88].  The  initial  cache 
hit  rate  CHR  was  before  the  insertion  of  the  MPB.  The  local  fait  rate  for  the  MPB  is  given  under 
MHR.  The  overall  hit  rate  for  the  cache  and  MPB  combined  is  listed  under  NHR.  The  speedup  is 
listed  for  the  overall  system.  For  these  examples,  each  line  of  the  MPB  consists  of  32-byte 
lines(blocks)  and  128  lines.  Each  line  is  boundary  aligned  in  the  same  way  as  the  cache.  That  is.  just 
as  the  cache  may  use  word  aligned  blocks,  so  does  the  MPB.  This  MPB  simulation  used  a  random 
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replacement  policy  for  the  removal  of  lines.  Toward  the  end  of  this  research  effort,  a  MPB  was 
simulated  using  a  least-recently  used  (LRU)  replacement  policy.  Several  simulations  using  this 
replacement  policy  showed  that  the  number  of  lines  in  the  MPB  could  be  reduced  while  maintaining 
the  effectiveness  of  the  MPB.  In  particular,  64  lines  were  shown  to  perform  nearly  as  well  as  128 
lines.  For  the  simulation  results  of  Table  2,  the  speedup  numbers  are  modest  but,  the  cost  of  this 
implementation  is  minimal  when  compared  to  a  256K  next  level  cache[PATHEN90]. 

In  addition  to  the  simulations  using  the  DEC  traces,  simulations  were  also  done  using  ATUM 
traces.  Table  3  list  results  of  simulation  using  ATUM  traces.  The  model  system  is  the  same  as  used 

Table  3:  MEMORY  PREDICTION  BUFFER  PERFORMANCE  (ATUM) 


Process 

MPB 

Lines 

Blocks 
per  line 

HRmpb 

HRc 

HRsyS 

Speedup 

Spice 

128 

32 

33.50 

93.22 

95.27 

6.75 

Pascal 

128 

32 

47.35 

95.62 

97.45 

9.80 

LISP 

128 

32 

69.75 

92.68 

97.72 

23.33 

FORTRAN 

128 

32 

40.11 

94.22 

96.90 

13.36 

in  the  DEC  trace  simulation.  These  simulation  results  can  be  used  to  motivate  further  research. 
ATUM  traces  are  relatively  short  for  cache  modelling  and  behavior  analysis.  Each  trace  is 
approximately  4(X),000  addresses.  This  number  of  addresses  is  marginally  adequate  for  a  32K  cache 
simulation  and  larger  cache-size  simulation  would  require  a  larger  number  of  addresses  for  proper 
and  accurate  simulation. 

For  the  preceding  research,  a  random-replacement  policy  was  used  by  the  MPB.  An  early 
implementation  of  the  MPB  using  a  least-recently-used  (LRU)  policy  shows  improved  performance 
over  the  random-replacement  algorithm. .  Table  4  lists  the  results  of  this  research  using  the  process 

Table  4:  MEMORY  PREDICTION  BUFFER  PERFORMANCE  (LRU) 


Process 

MPB 

Lines 

Blocks 
per  line 

HRmpb 

HRc 

HRsyS 

Speedup 

TREE 

128 

32 

79.11 

97.91 

99.98 

12.64 
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“tree”.  Results  of  this  implementation  using  other  processes  were  not  yet  accomplished  at  the  time 
of  the  report.  As  evidenced  by  all  these  simulation  studies,  the  MPB  is  shown  to  be  a  favorable 
architectural  concept  for  consideration  in  systems  where  the  highest  possible  performance  is  desired 
and  systems  costs  are  constrained. 
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VI.  CONCLUSIONS 


The  memory  prediction  buffer  is  proposed  as  a  component  for  high  performance  computer 
systems.  The  widening  gap  between  processor  speed  and  memory  subsystems  require  the 
investigation  of  alternative  architectures  for  reducing  main  memory  latency  while  restraining  costs. 
The  MPB  outperforms  prefetch  always  strategies  by  allowing  addressing  in  the  up  and  down 
direction.  In  addition,  the  MPB  does  not  contribute  to  pollution  of  the  cache.  Effective  memory 
latency  reduction  must  be  addressed  at  the  time  of  system  design.  In  addition,  as  the  requirements 
for  a  larger  address  space  grows,  memory  heirarchy  design  and  implementation  will  continue  to 
increase  in  complexity.  The  implementation  of  a  MPB  is  less  expensive  than  a  next-level  cache  and 
delivers  a  comparable  performance  enhancement.  In  addition,  the  algorithm  used  can  be  tailored  to 
the  proposed  system  environment  to  provide  a  more  effective  latency  reduction  structure.  The  MPB 
is  shown  to  improve  overall  system  performance  and  provide  reasonable  gains  in  speedup. 
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Vn.  RECOMMENDATIONS  FOR  FUTURE  RESEARCH 


The  memory  prediction  buffer  is  studied  and  simulated  for  enhancement  of  the  data  cache  of 
a  uniprocessor.  Its  use  or  enhancement  in  a  multiprocessor  environment  is  not  yet  known.  In 
addition,  the  question  of  whether  the  MPB  can  be  used  to  significantly  enhance  the  performance  of 
the  instruction  cache  has  not  fully  been  explored.  The  algorithm  for  the  MPB  of  this  research 
focused  on  a  random  replacement  policy  for  discarding  lines.  The  LRU  replacement  policy  showed 
an  improvement  over  random  however,  the  effect  of  other  replacement  policies  is  available  for 
discussion.  Simulation  and  study  of  the  memory  bandwidth  required  to  support  an  architecture  with 
a  MPB  and  without  a  MPB  is  needed.  A  comparison  of  the  amount  of  bandwidth  required  by  the 
base  architecture  (cache  and  processor)  with  the  bandwidth  required  by  the  architecture  with  a  MPB 
installed,  is  useful.  The  cache  write-back  policy  and  its  effect  on  systems  performance  with  and 
without  an  MPB  is  an  area  open  for  study. 


APPENDIX 
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X 1 0^  Instruction  and  Data  reads  from  memory 


(iBunoap)  snpA  ssajppv  iCiouiaj^ 
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Memory  Read  Access  Sequence 


.915G 
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1.913 


1.912 


X  iO^  LISP  Cache  Miss  Addresses  (data  only  -  region  1 .910G-1.915G) 


24 


Memory  Sequence  Number 


J.9J3 
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emory  Sequence  Number 


1.912 


(fvujpsp)  9npiy\  ssajppv 


Region  1.91 18G-1.9120G) 


X 1 0^  LISP  Cache  Miss  Addresses  (data  only) 


(piuipsp)  ssojppv  Ajoai3|f^ 
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Memory  Read  Sequence  Number 


(iBunoap)  9npi/\  ssaippy  /Cjoai9(f^ 
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X 10^  SOR  Algorithm  Cache  Miss  Address  Stream  (instr/data) 


•n 


(psunosp)  anfBA  ssajppy 
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Memory  Sequence  Number  x  1 0^ 
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